Linguistically Inspired Language Model Augmentation for MT
نویسندگان
چکیده
The present article reports on efforts to improve the translation accuracy of a corpus–based Machine Translation (MT) system. In order to achieve that, an error analysis performed on past translation outputs has indicated the likelihood of improving the translation accuracy by augmenting the coverage of the Target-Language (TL) side language model. The method adopted for improving the language model is initially presented, based on the concatenation of consecutive phrases. The algorithmic steps are then described that form the process for augmenting the language model. The key idea is to only augment the language model to cover the most frequent cases of phrase sequences, as counted over a TL-side corpus, in order to maximize the cases covered by the new language model entries. Experiments presented in the article show that substantial improvements in translation accuracy are achieved via the proposed method, when integrating the grown language model to the corpus-based MT system.
منابع مشابه
Language Model Data Augmentation for Keyword Spotting in Low-Resourced Training Conditions
This research extends our earlier work on using machine translation (MT) and word-based recurrent neural networks to augment language model training data for keyword search in conversational Cantonese speech. MT-based data augmentation is applied to two language pairs: English-Lithuanian and English-Amharic. Using filtered N-best MT hypotheses for language modeling is found to perform better th...
متن کاملImproving the Translation of Discourse Markers for Chinese into English
Discourse markers (DMs) are ubiquitous cohesive devices used to connect what is said or written. However, across languages there is divergence in their usage, placement, and frequency, which is considered to be a major problem for machine translation (MT). This paper presents an overview of a proposed thesis, exploring the difficulties around DMs in MT, with a focus on Chinese and English. The ...
متن کاملReversible Template-based Shake & Bake Generation
Corpus-based MT systems that analyse and generalise texts beyond the surface forms of words require generation tools to re-generate the various internal representations into valid target language (TL) sentences. While the generation of word-forms from lemmas is probably the last step in every text generation process at its very bottom end, token-generation cannot be accomplished without structu...
متن کاملSoft Syntactic Constraints for Hierarchical Phrased-Based Translation
In adding syntax to statistical MT, there is a tradeoff between taking advantage of linguistic analysis, versus allowing the model to exploit linguistically unmotivated mappings learned from parallel training data. A number of previous efforts have tackled this tradeoff by starting with a commitment to linguistically motivated analyses and then finding appropriate ways to soften that commitment...
متن کاملStone Soup Translation: the Linked Automata Model
The automated translation of one natural language to another, known as machine translation (MT), typically requires successful modeling of the grammars of the languages and the relationship between them. Rather than hand-coding these grammars and relationships, some machine translation efforts employ data-driven methods, where the goal is to learn from a large amount of training examples of acc...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
عنوان ژورنال:
دوره شماره
صفحات -
تاریخ انتشار 2016